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Abstract 

For a genomically unstable cancer, a single tumour biopsy will often contain a mixture of 
competing tumour clones. These tumour clones frequently differ with respect to their genomic 
content (copy number of each gene) and structure (order of genes on each chromosome). Modern 
bulk genome sequencing mixes the signals of tumour clones and contaminating normal cells, 
complicating inference of genomic content and structure. We propose a method to unmix 
tumour and contaminating normal signals and jointly predict genomic structure and content of 
each tumour clone. We use genome graphs to represent tumour clones, and model the likelihood 
of the observed reads given clones and mixing proportions. Our use of haplotype blocks allows 
us to accurately measure allele specific read counts, and infer allele specific copy number for 
each clone. The proposed method is a heuristic local search based on applying incremental, 
locally optimal modifications of the genome graphs. Using simulated data, we show that our 
method predicts copy counts and gene adjacencies with reasonable accuracy. 


1 Introduction 

Human cells have evolved DNA repair mechanisms to mitigate the effects of DNA breakage during 
transcription and replication. During the lifetime of many cancers, one or more of these mechanisms 
will be compromised. With DNA repair compromised, DNA breakages will either go unrepaired 
or will be repaired by a less accurate but still functional mechanism. Improper repair of DNA 
breakage events will result in structural chromosomal aberrations in descendent tumour cell lineages. 
Structural aberrations can then lead to further problems during mitosis, with incorrect segregation 
of DNA to daughter cells resulting in numerical chromosomal aberrations [I]. If both daughter 
cells are viable, incorrect segregation leads to a divergence in chromosomal content between the 
two descendent lineages. The progressive acquisition of structural and numerical chromosome 
aberrations is referred to as genomic instability. 

A direct consequence of genome instability is intra-tumour heterogeneity [Tj. A sample of 
a biopsy from a genomically unstable cancer will contain tumour cells from lineages that have 
diverged in the structure and content of their chromosomes. Significant changes do not usually 
out-pace cell division. Thus collections of cells, referred to as clones, will be genomically similar. 
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Bulk sequencing of such a sample mixes the signals from tumour clones and contaminating normal 
cells. An important problem in cancer genomics is the unmixing of these signals, and reconstruction 
of the structnre and content of the genomes of each clone. The key difficulty of the problem is that 
mixing dilutes the signal of the changes of interest, often to a level approaching that of the noise 
in the data. 

Existing methods focus on accurate modeling of the number of the copies of each reference 
genome segment in the sequenced tumour clone or clones. A simple model of genome structure 
predominates for most tools: segments in the model are adjacent only if they are also adjacent 
in the reference genome. An additional copy of a segment is implicitly modeled as a copied and 
truncated chromosome, when in reality it may be a tandem duplication resident on the original 
chromosome. Theta and Theta2 [lOt m infer the copy number of tumour clone genomes and 
mixing proportions of tumour clones and contaminating normal cells. Both tools assume a-priori 
knowledge of large segments of the genome with identical clone specific copy number, and model 
adjacent segments independently. Titan [6] uses an HMM to model spacial correlation between 
segments adjacent in the reference genome, however the state space of the HMM is restricted to 
allow only one aberrant genotype per segment. A similar method, CloneHD [5j uses an factorial- 
HMM with a more comprehensive state space. 

Simplihed models of connectivity are reasonable for genomic prohles using array based tech¬ 
nologies. With whole genome sequencing, tumour specihc adjacencies, or breakpoints, are readily 
available and can be predicted with reasonable accuracy using a variety of tools. Breakpoints pro¬ 
vide the potential for a more comprehensive model of genome structure that includes long range 
connectivity between genomic segments. A important question in computational biology is the ex¬ 
tent to which a more comprehensive model of genome structure has the potential to improve copy 
number inference. Furthermore, a method that integrates both copy number and breakpoints could 
provide additional information about each breakpoint: whether the breakpoint is real or a false 
positive, the prevalence of the breakpoint in the clone mixture, and the number of chromosomes 
harboring the breakpoint per clone. 

Some progress has been made on more comprehensive modeling of genome structure in tumour 
clones. [8] proposes an algorithm to infer missing adjacencies in a mixture of rearranged tumour 
genomes, however they do not model copy number. [12] proposes a framework for sampling from 
the rearrangement history of tumour genomes. |in) proposes PREGO, a method for inferring the 
copy number of segments and breakpoints using a genome graph based approach, though they do 
not model normal contamination or tumour heterogeneity, limiting applicability of their method to 
real tumours. 

Additional progress in the ability to infer divergent genome structure remains relevant to cancer 
research. Subclonal copy changes (changes in a subset of clones) are difficnlt to assess with single 
cell methods, whereas coincident subclonal breakpoints could be more easily assessed for their 
presence in individual cells. Given multiple samples, an ability to identify subclonal events will 
enable more accurate tracking of complicated patterns of metastasis, such as sample heterogeneity 
produced by multiple-metastases to the same anatomic site |2|. Furthermore, subclonal changes 
represent contemporary events, whereas ubiquitous changes represent historical events. The ability 
to discern historical from contemporary breakpoints may provide insight into the repair mechanisms 
that have been active historically versus those that were active at the time the tumour was sampled. 

We propose a method for joint inference of genomic content and structure given tumour sequenc¬ 
ing data and a set or predicted breakpoints. Our method is built upon two important assumptions. 
First we assume that intelligent aggregation can be used to gain additional statistical strength, and 
increased power to detect changes in a minor tumour clone. Thus, we choose a larger segmentation 
than competing methods. To avoid the additional noise associated with a true copy number change 
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occuring in the middle of a segment, we augment our segmentation with breakends of predicted 
breakpoints, with the intention of capturing the majority of true copy number changes across the 
genome. Furthermore, we use counts of reads for alleles of each haplotype block, rather than for 
alleles of each heterozygous SNP, increasing statistical strength for inference of allele specific copy 
number changes. Second, we assume that a more comprehensive model of genome structure that 
includes the long range connectivity implied by rearrangement breakpoints will improve inference 
accuracy. We model the likelihood of observing the sequencing data given a mixture of genome 
graphs, where each genome graph represents the content and structure of the genomes of tumour 
clones. Our method can be considered a natural extension of the factorial-HMM used by cloneHD, 
which can be thought of as a mixture of HMMs for modeling the clone mixture. 

We show using simulated genomes that our method out-performs Titan, Theta2 and CloneHD 
for inference of clone specific copy number. We also compare our method against two breakpoint 
naive methods for inferring copy number of segments and breakpoints; a factorial HMM, and a 
model assuming indepenence between segments. For the breakpoint naive methods, we post-hoc 
assign breakpoint copy number based on segment copy number. We show that integration of 
breakpoints improves inference of segment copy number given the same data, and that post-hoc 
assignment of breakpoint copy number is less accurate than using an integrated model. 

2 Problem Definition 

Assume as given whole genome sequence data from tumour and matched normal samples. Suppose 
we have predicted a set of tumour specific rearrangement breakpoints with reasonable accuracy. 
We aim to predict the following: 

1. mixture proportions of tumour clones and normal cells 

2. clone and allele specific copy number of genomic segments 

3. clone specific copy number of rearrangement breakpoints 

The strongest signal of copy number change is from segment specific differences in counts of 
reads aligned concordantly to the reference genome. We thus model the likelihood of concordant 
read counts for large segments. A likelihood model on its own will over-fit to the data, thus we also 
impose a structure on the clone specific copy number by modeling each clone as a genome graph 
in a mixture of genome graphs. A more detailed description of the problem is provided below. 

2.1 Mixtures of Genome Graphs 

Assume as input a set of alignments of uniquely mapped concordant reads, and a set of putative 
breakpoints predicted from discordant reads. A segmentation of the genome is not assumed. We 
segment the genome using a relatively large segment length (SMB for this study), and augment 
the set of segment boundaries with breakends of rearrangement breakpoints. More formally, let V 
represent the space of all reference nucleotides, with each nucleotide represented as a chromosome 
position pair. Impose an arbitrary order on the chromosomes thus making elements of V sortable. 
Each concordant read alignment r = (x, y); x, y G V is a pair of positions representing the start 
and end of the alignment of the read in the genome. Assume a set of breakpoints B have been 
identified from analysis of the discordant reads. Each breakpoint b £ B] b = {vj,Vk)', Vj,Vk £ W 
is a pair of segment extremities putatively adjacent in a tumour chromosome, but not adjacent in 
the reference genome. 
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Let S' be a set of N segments defined on a set of 2 N ordered segment extremities W C V. 
Segment extremities include boundaries of regular segments and breakends {^{vj,v^.)&B{vj,Vk})■ 
Segment G S is defined by a pair of extremities Sn = {v2n-i,V2n) such that V2n-i < V2n- The 
length of segment n is thus In = V2n — V2n-i + 1 - Let A be the set of reference adjacencies, such 
that a = {v2n,V2n+i) G ^ if and only if V2n and V2n+i are on the same chromosomes. We assume 
the segmentation covers the entire genome, thus £ A] a = {v2n, V2n+i) ■ V2n + 1 = V2n+i- 

In construction of the genome graph representation of rearranged tumour genomes we will need 
an artihcial construct to represent tumour specific chromosome ends which we call telomeres Q 
Let sat+i = (u2Ar+i, U27V+2) be a dummy telomere segment between two dummy extremities V2N+1 
and V2N+2- Dehne a telomere as an adjacency between a real segment extremity and either V2N+1 
or V2N+2- Define the space of all telomeres T as the edges of a complete bipartite graph between 
vertex set W and vertex set {v2n+i-,V2n+2}■ 

Let vertex set V = W U {v2n+i,V2N+2} be the set of segment extremities plus 2 additional 
dummy telomere vertices. In defining the genome graph, we will deliberately use sets of position 
pairs interchangeably with their representative edges for A, B, S and T. Define the genome graph 
H = {V, E = (S', Q)) ;Q = AuBUTasa bi-edge-colored graph on vertex set V, where segment 
edges in S are given a color distinct from bond edges in Q (represented as thick dashed and thin 
solid respectively in Figure [^. Bond edges have 3 classes, reference (edge set A), breakpoint (edge 
set B) and telomere (edge set T). 



Figure 1: A genome graph on 6 regular segments. Segments edges (black, thick dashed) 
connect vertices representing segment extremities. Reference bond edges (yellow, thin solid) 
connect segments to recapitulate the reference chromosomes. Breakpoint bond edges (red, 
thin solid) represent putative connections between segment ends as identified through anal¬ 
ysis of discordant sequencing reads. A dummy segment edge (bottom) is used to connect 
the end points of linear chromosomes via telomere bond edges (blue, thin solid). Telomere 
bond edges form a complete bipartite graph on the pair of vertices incident to the dummy 
segment edge, and all other vertices. 

Dehne a linear chromosome as a sequence of (possibly repeated) oriented segments, and a 
circular chromosome as a cycle of (possibly repeated) oriented segments. Thus a linear chromosome 
is an alternating walk in H starting and ending with a segment edge, and a circular chromosome is 
an alternating tour in H. For algorithmic convenience, we will model all chromosomes as alternating 
tours. An alternating walk that represents a linear chromosome can be transformed into a tour 
by connecting each end of the chromosome to opposite ends of the dummy telomere segment edge 
using telomere bond edges. A telomere bond edge is considered observed if it is incident to a vertex 
representing the end of a reference chromosome, with remaining telomere bond edges considered 
unobserved. All reference and breakpoint bond edges are considered observed. 

^acknowledging the biological inaccuracy of applying the term telomere 
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A genome can be represented exactly as a collection of alternating tours in H, though such 
a representation would be unidentifiable based only on adjacency information provided by whole 
genome sequencing. Instead, a collapsed representation is more convenient. Define a genome 
instance g : E ^ N as an assignment of counts to edges in H. A genome instance is valid if and 
only if there exists a collection E of alternating tours in H for which each edge e appears g{e) 
times in T- Alternatively, let S{v) and Q{v) be segment and bond edges incident with vertex 
V respectively. A genome instance is valid only if copy number balance condition holds 1 [lOj. 
Equation . 

Vu e E : g{e) = Y ^(e) ( 1 ) 

eGS{v) e&Q{v) 

Call a genome instance that obeys the copy number balance condition as balanced. A genome 
instance g may be balanced but not valid if for some edge e, g{e) < 0 . Balanced genome instances 
will be important as modifications of valid genome instances. 

Define a genome collection ^ as a collection of M genome instances. Let Rm be the number 
of reads contributed by cell population m in a heterogeneous tumour sample. We assume each 
of the Rm reads are sampled uniformly from the length Lm of the genome of cell population m. 
Thus, each cell population contributes a specific number of reads per nucleotide, hm = to the 
sequencing experiment. We call hm the haploid read depth since it is the read depth contributed 
by a single copy of a segment. Define a genome mixture as a pair (Q,h), where ^ is a collection of 
genomes and h is a vector of haploid read depths, one per genome. 

The expected read count of segment n will thus depend on: a) the length of the segment b) 
the copy number of the segment in each population c) the haploid read depth of each population. 
Observed read counts will be subject to sampling error, thus we can form a likelihood of expected 
read counts given observed read counts. 

Our goal is to identify a genome mixture that: 

1. maximizes the likelihood of the observed read counts 

2 . minimizes the number of unobserved telomere bond edges used by each tumour genome 

2.2 Allele Specific Genome Graph 

We model the copy number of each parental allele for each segment. Complete loss of a parental 
allele, termed Loss Of Heterozygousity (LOH), is of particular biological interest as such events 
frequently occur as part of a ‘double hit’ targeting a tumour suppressor gene. A method that 
infers the specific copy number of each allele will enable us to identify such biologically important 
events. Furthermore, many previous methods have shown increased performance when modelling 
allele specific versus total copy number. 

We define the allele specific genome graph H' to jointly model both genome structure and allele 
specific copy number. Construct H' = {V',E'), E' = S' IdQ', from genome graph H by duplicating 
edges and vertices for alleles 1 and 2 . For each vertex Vn & V create two vertices Un,i, Vn,2 in V. For 
each segment edge Sn = {v2n-i,V2n) £ S, create two segment allele edges = {v2n-i,i,V2n,i) and 
Sn,2 = iv2n-i,2, V2n,2) in S'. For each bond edge e = {vj,Vk) G Q, create four bond edges {vj^i,Vk,i), 
{vj,i,Vk^2)^ {'Vj,2,Vk,i), {vj,2-,Vk,2) in Q'■ Subsequent sections will refer to the allele specific genome 
graph and its edges and vertices as H, E = S U Q and V respectively. 
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2.3 Haplotype Allele Specific Read Counts 

Heterozygous germline SNPs can be used to classify reads (or previously array probe intensities) 
into one of the two parental alleles. Classification into paternal/maternal is impossible without 
parental DNA, and somewhat irrelevant in the context of this work. Thus parental alleles and 
associated read counts are distinguished from each other as major and minor according to which 
allele has more copies, or which class of reads has more counts. 

Considerable power to estimate allele specific copy number can be gained by using haplo¬ 
type information. Let the Xi G { 0 , 1 } be a binary indicator representing the two possible al¬ 
leles of heterozygous SNP i. A haplotype bloek h = {i,k,y),y G { 0 , 1 }^ is a sequence of al¬ 
leles {xi = yi,Xi+k-i = Uk) for k consecutive SNPs starting at i, where the sequence of al¬ 
leles exist consecutively on the same physical chromosome. The alternate haplotype bloek h = 
{xi = yi, ..,Xi+k-i = Vk) represents the other of the two parental chromosomes (here x = 1 — x). 

Major and minor read counts can be grouped by haplotype block for increased statistical 
strength. Specifically, call a read r as non-conflicting with h = {i, k, y) if for all j G {i,.., z -t- A; — 1 } 
read r matches allele = yj. Call a read r as supporting of h if it is non-conflicting, and 

contains at least one SNP j from j G {z,..,z-|-/c — 1 }. Let and be counts of reads that support 
h and h respectively. Assuming the haplotypes are correct, Zh and zj^ will provide more accurate 
representations of allele specific copy number than counts of reads for individual SNPs. We infer 
haplotype information using 1000 genomes data and shapeit 2 [3j. 


2.4 Genome Mixture Likelihoods 


As described above, assume ^ is a collection of M genomes, indexed by m, and genome instance 
gm is a mapping g ■. E ^ assigning a copy number to each edge in the allele specific genome 
graph H. For convenience, let matrix Cn denote the clone specific copy number state of segment 
n, such that Cnme denotes the copy number of segment n, genome m, allele i. In other words, 
Cnmi = 9m{snd for Segment allele Sn^■ 

Observed data X G are per-segment total and allele specific read counts. For segment 

n, Xu'S is the count of concordantly aligning reads contained within segment n, and Xni and Xn 2 
are the counts of the subsets of those reads classified as from the major or minor allele (fc = 1 and 
k = 2 respectively). As described above, alleles are distinguish as major/minor based on which 
allele has higher read count, though this does not necessarily relate to which allele has more copies 
in individual tumour clones. Modeled alleles i = 1 and i = 2 correspond with measured major 
(fc = 1) and minor {k = 2 ) alleles respectively. 

We use haplotype blocks to accurately calculate major and minor read counts. First, haplotype 
blocks spanning segment boundaries are split into a separate block per spanned segment. Second, 
we calculate supporting read counts Zh^ , and zj^ . for each haplotype block i in segment n. Allele 
specific read counts Xni are calculated as given by Equation 


^nk 


J2max{zh^^, z-^^ J ifk = l 

i 

^min(z/*„ J iffc = 2 
. i 


( 2 ) 


Let Pntk represent, for segment n, the proportion of reads from allele i that can contribute to 
measurement k, assumed known. For total read counts {k = 3 ), pnk£ = I ioi i G { 1 ) 2 }, since 
total read count includes all reads from both alleles. For major and minor read counts, calculate 
the proportion of reads that can be genotyped by heterozygous SNPs for segment n as given by 
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Equation!^ Since reads from allele l = \ can not contribute to measured read count Xn 2 -, and visa 
versa, puM = 0 for k t.,k G {1,2}. Equationfully specifies Pnik- 




Pnki 


Xfii Xn2 

^nS 

(pn k = i,k ^ {1, 2} 
< 1 if /c = 3 
0 else 

\ 


(3) 

(4) 


Let ^nk £ l^>o model expected major/minor/total read count, calculated as given by Equation]^ 

(5) 


f^nk — EE hm CnmiPnki 

m I 


We model the likelihood of allele specific or total read counts x given expected read counts 
/r using either a Poisson (Equation]^ or Negative Binomial (Equation]^. The over-dispersion 
constant for the Negative Binomial is estimated off-line (Section |6.2[ ). See Section 6.3 for a discussion 
of the independence assumption in comparison to similar models. 


p{x\n) 

p{x\^,r) 


x\ 

r{r + x) ! r f P 
r{x)r{r) \r + n) \r + 


( 6 ) 

(7) 


2.5 Prior Probability over Genome Structure 

Positive copy number assigned to a bond edge e by genome instance g-m implies edge e ‘exists’ in 
the genome mixture. Observed bond edges are assumed to have higher prior probability of existing. 
Thus, we place a prior probability over the number of unobserved edges used by genome instances 
in a genome mixture. Let U C Q he the set of unobserved bond edges in H. Calculate a prior for 
the copy number of genome instance g as given by Equation 

P{g\l3) oc (8) 

e£U 

In log space the above prior amounts to a fixed penalty on each additional copy of an unobserved 
bond edge. Such a prior prevents over-fitting of the genome structure, such as a genome instance 
that assigns positive copy number to many telomere edges in order to be able to fit each segment 
edge perfectly to the segment likelihood. Thus the impact of Equation is similar to that of a 
transition matrix in an HMM for modeling spacial correlation, and in fact the genome graph model 
with no breakpoints has an equivalent representation as an HMM. Higher values of /3 have the effect 
of smoothing over false positive deviations in predicted copy number that result from sampling error 
of identical true copy number. 


2.6 Maximum Posterior Genome Mixtures 

Our overall objective is to identify the genome mixture {G,h) that maximizes the full posterior 
(Equation]^ given /3. 

piG,h\X,(3) oc p{X\g,h)l[P{g\/3) (9) 

a 
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3 Method 


3.1 Method Overview 

We separate the maximum posterior genome mixture problem into two subproblems: 

1. learn h 

2. infer Q given h 

For problem 1, we remove breakpoint edges from the genome graph, reducing the graph to a 
hidden markov model (HMM), making learning of h more tractable. For problem 2, we use an 
approximate combinatorial method to infer Q given h, initializing at the results of running the 
viterbi algorithm on the HMM used to learn h. 


3.2 Expectation Maximization Method for Learning h 


We learn h using the Baum Welch algorithm for learning the parameters of a Hidden Markov Model 
(see Section 6.1). In brief, we use expectation maximization to find a value a local maxima of the 
marginal likelihood function (Equation |10[). 


L{X\h) = E p{X,C\h) 
c 


( 10 ) 


At iteration t, the algorithm calculates the that maximizes the expected value of the com¬ 
plete data log likelihood p{X,C\h) with respect to the conditional distribution p{C\X,h^^~^\ ■). 
We use multiple restarts with different initial in an attempt to discover the global maximum. 


3.3 Combinatorial Method for Inferring Q 


We propose a greedy algorithm for inferring the full structure of the genome collection Q given h 
estimated from large segments. Given a current solution our aim is to select from a set of 
possible modifications that a) are simple to calculate, b) are comprehensive enough to escape local 
optima, c) produce a valid genome collection when applied to A valid genome collection 

is defined as a genome collection for which all genomes g € G have positive copy number and obey 
the copy number balance condition. 

Modeling the likelihood of total reads introduces a dependency between the copy number of 
each allele of the same segment, complicating inference. For the purposes of inferring Q, we model 
only the likelihood of allele specific read counts, allowing for the independent modeling of the alleles 
of each segment. Let Ce be a vector of per clone copy numbers assigned to edge e for each of the 
m genome instances, such that Cem = fi'm(e). Suppose edge Sne G S represents segment n, allele i. 
The likelihood of allele specific read counts Xnk where k = i can be written as p{xnk\p-ni) where 
Pni is calculated as given by Equation [TT| 




E 


n' ^m'-emH^n 


( 11 ) 


We define the objective function of our greedy heuristic as given by Equation 13 


f{Q) = - logpig\X,h,-) 

e£E 

. . ^ _ \ -log p{Xnk\gne) if e = G S' 

■ lE™/3Wr.(e) if eel/ 


( 12 ) 

( 13 ) 




3.3.1 Representing modifications as edge disjoint sets of alternating cycles 

We propose an algorithm for identifying a modification of optimal with respect to /, from a 
restricted set of modifications that increase or decrease the copy number of any edge by at most 1 . 
Consider valid genome instance gi and balanced genome instance qa ■ The set of balanced genome 
instances is closed under addition and subtraction. Thus 92 = 9i + 9A will be a balanced genome 
instance, and if 52 (e) = 51 (e) + gA{e) > 0 Ve, then 52 will also be valid. 

An obvious set of candidate gA include all gA that set 5zi(e) = 1 (or gA{&) = —1) for all 
edges e in some alternating cycle of H. Such gA would be analogous to adding (or subtracting) a 
circular chromosome to the genome instance. However, it is trivial to show for such a restricted 
set of modifications, a greedy algorithm would easily get stuck in local optima. In Figure [ 2 ^, 
for instance, transformation of the genome instance on the left to that on the right would not 
be possible without an intermediate state that assigns 2 copies to each segment edge. Such an 
intermediate state may be prohibitively unlikely if there is considerable evidence for the segment 
to be copy number 1 . 



Figure 2: Genome graph modifications. Thick dashed lines show segments edges, thin solid 
lines show bond edges. A) Transformation of one genome instance to another via addition 
of a modification. Edges are annotated with copy number for edges with non-zero copies. 

Note that a transformation performed in two steps, one addition and one subtraction, would 
likely result in a highly improbable intermediate state with segment edge copies set to 2. 

B) Representation of the modification as an alternating cycle in the genome graph. 

Instead, we require candidate gA that both add an subtract edges, as swapping one set of edges 
for another set will allow easier movement through the space of possible genome collections. Let 
5+ be a balanced genome instance constructed such that 5+(e) = I for all edges in a set of edge 
disjoint alternating cycles in H. Note that since each vertex is incident to a single segment edge, 
edge disjoint alternating cycles are also vertex disjoint. Let 5_ be defined similarly to 5+, with 
5_(e) = —I for a distinct set of edge disjoint alternating cycles in H. Let gA = 9+ + 9-- Call gA 
thus constructed as a simple genome modification. 

Proposition 1. Any simple genome modification has a representation as an edge disjoint set of 
alternating cycles in a specific hi-edge-colored graph derived from H. 

Proof. Edges in gA will have copy number in {—1,0,-|-1}. Remove all 0 edges from gA- Assign a 
color to each remaining edge as follows: 

• segment edge e with gA{&) = +!—)• red 

• segment edge e with 5/i(e) = — 1 —)■ blue 

• bond edge e with gA{e) = +1 —)• blue 
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• bond edge e with ^/^(e) = — 1 —>■ red 

Now consider any vertex v € H. Let s be the segment edge incident with v, and let Cj be the 

bond edge incident with v. Suppose gA{s) = 0, but there exists i such that gAi&i) / 0. Then 
there must exist ii,i 2 such that gA{eii) = +1 and gA{eii) = — 1 , otherwise g+ and g- would not 
both be vertex disjoint. Furthermore, will be colored blue and Cjj will be colored red. Suppose 
9 z\('S) = +1. Then there exists a single bond edge incident with v such that gA{&i) = +1 by the 
copy number balanced condition and the requirement that 5 + be vertex disjoint. Edge s will be 
colored red and Cj will be colored blue. A similar analysis for a segment edge s with gA{s) = — 1 
shows that s will be colored blue and a single incident bond edge will be colored red. Thus at 
any vertex, either 0 or 2 incident edges will have non-negative copy number. Furthermore, for 
vertices with 2 incident non-negative copy number edges, the color of the edge as defined above 
will alternate. Thus gA can be represented as a set of vertex disjoint alternating cycles. □ 

We now show how to construct a graph F called the genome modification graph (Figure]^, 
such that any vertex disjoint alternating cycle in F represents a valid simple genome modification. 
Add a single copy of each vertex in Ff to E and two copies of each edge in H to F. Label one 
duplicated edge as and the other duplicated edge as ‘-1’. Color each edge as follows: 

• segment edge e labeled -|-1 —>■ red 

• segment edge e labeled — 1 —)• blue 

• bond edge e labeled -|-1 —)• blue 

• bond edge e labeled — 1 —)■ red 

B 


Figure 3: A) A small subgraph of a genome graph with a subset of edges is shown. B) 
Transformation of the genome graph to a genome modification graph. Each edge is dupli¬ 
cated and labelled with ‘-I-’ or Segment edges are colored red for label ‘-I-’ and blue for 
label Bond edges are given an opposite labelling: blue for label ‘-I-’ and red for label 



A 



Proposition 2. An edge disjoint set of alternating eyeles C in F represents a valid simple genome 
modification. 

Proof. A transition from -|-1 labeled edges to —1 labeled edges (or visa versa) in C only occurs for 
two adjacent bond edges. Let W be the set of vertices with 2 incident bond edges from C. The size 
of W will be even since the number of transitions must be even. Furthermore, it is always possible 
to construct a set of edge disjoint paths V connecting pairs of vertices in W. Let u be a vertex 
in W, and s = {v,u) the incident segment edge. If rt G W, then s is a one edge path connecting 
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v,u € W. If u 0 W, then there must exist another vertex v' ^ W with incident segment edge 
s' = {v',u'), and u' 0 W. Then (u,tt') is a bond edge not in C and {v,u), {u,u'), {u\v') is a three 
edge alternating path connecting v,u G W. Now edges in C labeled +1 can be seen as edges unique 
to a genome instance g+, edges in C labeled —1 can be seen as edges unique to a genome instance 
g -, and edges in C are edges common to both g^ and g-. □ 

Proposition 3. Any simple genome modification can be represented as an edge disjoint alternating 
cycle in F. 

Proof. Follows directly from Proposition □ 

Note that simple genome modifications involving only bond edges are equivalent to a balanced 
rearrangement in a breakpoint graph. 


3.3.2 Selecting an optimal simple genome modification 

For a collection of M genomes, consider the set of edge copy number modifications T^, where 
T = { —1,0,+1}. Let Z\ G 7 ^ 0 be a specific modification affecting at least one of the M 

genomes. Let ^ G represent the acceptance of the modification or its inverse for each edge in E. 
Thus ZeA represents modihcation by A, —A or 0 for Ze equal to +1, —1 and 0 respectively. Write 
the objective of a locally optimal modification given A as shown in Equation |14[ 


argmin ^ /e(ce + ZgA) 
e£E 

S.t. ZeA = ZeA Vv G V, 


e&S( 

v) 

e£Q{v) 

1 > 

E 

\Ze\ Vu G E, 


e&S{v) 


1 > 

E 

kel Vu G E 


e£Q(v) 



(14) 

(15) 

(16) 
(17) 


The first constraint (Equation 15) is simply the copy number balance condition. The second 
and third constraints (Equations 16 and [Tt]) ensure that the set of alternating cycles represented 
by z is vertex disjoint, and is thus a simple genome modification. 

To identify the locally optimal modification, first create genome modification graph F, then 
create cost function k assigning cost to each edge as follows: 


• edge e labeled +1 fe{ce + A) - fe{ce) 

• edge e labeled -I ^ fe{ce - A) - fe{ce). 

Proposition 4. Assume fe is convex. A minimum cost vertex disjoint set of alternating cycles C 
in F given cost function k is a locally optimal modification minimizing Equation\lj\ 

Proof. Any z can be transformed into a specific C by selecting the edges in F corresponding to 
settings of z (label +1 if ^;e = +1, label —1 if Zg = —1, or no edge if Ze = 0). Conversely, suppose C 
includes both the +1 labeled edge and —1 labeled edge that correspond to a single edge in H. The 
convexity of fe implies that /e(ce + A) — fe{ce) + fe{ce — A) — fe{ce) > 0. Selection of both edges 
implies fe{ce + A) — fe{ce) + fe{ce — A) — feipe) < 0, otherwise neither would have been selected. 
Thus /e(ce + A) — /e(ce) + /e(ce — A) — /e(ce) = 0, and both edges can be removed from C without 


11 






consequence to the objective. Let C be derived from C by removing pairs of +1 labeled edge and 
— 1 labeled edges edge corresponding to a single edge in H. There is a bijection between the space 
of possible 2 ; and the space of possible C. Furthermore, minimizing Equation 14 is equivalent to 
minimizing YleeE (/e(ce + ZgA) — /e(ce)). The result follows. □ 


To identify the minimum cost vertex disjoint set of alternating cycles in the bi-edge-colored 
graph F = {V, E = {E\ E"))), we use the following well known transformation. Create two separate 
graphs on distinct vertices, E' = {V',E') and F" = {V”,E"), by selecting a subset of edges from 
F. Merge F' and F" to create F'" by adding additional edges E'" connecting vertices in E' with 
vertices in E" that originated from the same vertex in F. Let M be a prefect matching in F'". Such 
a matching exists since E'" is a perfect matching. Furthermore, M is a perfect matching in E'" if 
and only if it is composed of a set of edge disjoint alternating cycles in F and a subset of edges 
from E'" matching the remaining unmatched vertices in E'". The set of edge disjoint alternating 
cycles is the minimum cost such set in E. We use Blossom V [7] to calculate the minimum cost 
perfect matching. 


3.3.3 Greedy Algorithm 

We propose the following algorithm for solving the maximum posterior genome mixture problem. 
Given are haploid read depths h. Initialize ^ to a valid genome collection as follows. Set the copy 
number of segment edge n to c^. Set the copy number of the bond edge parallel to segment edge 
n also to Cn- The algorithm proceeds as follows. Given current iteration t, identify the minimum 
cost modihcation A of from the set of all possible modifications A G T^,Z\ 7 ^ 0. If the 
minimum cost modification has cost less than 0, apply it to to create and continue. If 

all modifications have cost 0 , stop iteration. 


4 Results 

4.1 Simulating rearranged genomes 

We developed a principled method of simulating rearranged genomes that fulfilled two important 
criteria. First, the simulated tumour genomes are required to have been produced by a known 
evolutionary history composed of duplication, deletion, and balanced rearrangement events applied 
successively to a initially non-rearranged normal genome. Second, the copy number prohle of the 
simulated tumour genome should be reasonably similar to that of previously observed tumours. 

Naively applying random rearrangements to a normal genome would result in a significant 
number of regions that are homozygously deleted. Such a scenario is unrealistic given that large 
scale homozygous deletion would remove housekeeping genes necessary for the survival of the cell. 
Alternatively, a naively simulated genome could become exceptionally large, an outcome that is 
rarely observed given the presumed burden of replicating such a genome. Thus, we developed a 
re-sampling method for producing realistic rearranged genomes. At each step in the simulation of 
an evolutionary history, we re-sample a swarm of genomes according to a fitness function. Fitness is 
calculated as the multinomial likelihood of the simulated copy numbers given average copy number 
proportions from a set of real tumours. For this study we used copy number proportions measured 
from 7 high grade serous tumours (data not shown). 

To simulate a mixture of related genomes, we first simulate the rearrangement history of the 
common ancestor. We then modify the fitness function to include term that controls the deviation 
between ancestor and descendant. A target deviation is specified as the proportion of the genome 
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with divergent copy number state between ancestor and descendant. The deviation term is the 
squared error between simulated deviation and target deviation, with some user specified scaling 
factor, or variance. 

We simulated 20 mixtures of rearranged tumour genomes. Genomes in each mixture harbored 
50 ancestral and 40 clone specific rearrangements, with an additional 50 false rearrangements. Each 
genome consisted of 1000 segments with randomly sampled lengths totaling 3 x 10® nt. Segment 
lengths as a proportion of genome length sampled from a Dirichlet distribution with concentration 
parameter 1. Proportion genotypable reads uniformly from between 0.05 and 0.2. We assumed the 
samples were composed of 40% normal cells and 2 tumour clones. Target deviation was set at 30%. 
Minor clone proportions were set to 5, 10, 20, and 30% of cells. Read counts were simulated using 
a negative binomial likelihood given segment copy numbers and assuming 40X sequencing]^ 

4.2 Benchmarking learning haploid depth using simulated data 

We used EM to infer the haploid read depths {h parameter) within the context of the HMM version 
of our model for simulated datasets. For 50% of the simulated datasets, minor clone proportion 
predictions are within 5% of the simulated value, and normal proportion predictions are within 2% 
of the simulated value (Figure Q. 

4.3 Benchmarking structure and content prediction using simulated data 

We applied 3 decoding algorithms to the read count data assuming the clone proportions and 
sequencing depth were known. The independent algorithm calculates the maximum likelihood 
copy number state of each segment and then post-hoc assigns copy number to breakpoints. The 
Viterbi algorithm calculates the maximum posterior path through the HMM representation of each 
chromosome, also assigning breakpoint copy number post-hoc. The genomegraph algorithm uses 
the proposed algorithm to simultaneously infer segment and breakpoint copy number. For optimal 
performance, the genomegraph algorithm is initialized with the results of the Viterbi. 

We calculated 3 measures of performance, f-measure of ability to predict breakpoints as present 
versus false, f-measure of ability to predict breakpoints as subclonal, and proportion of segments 
for which the correct copy number is identified. The genomegraph algorithm outperforms the inde¬ 
pendent and Viterbi algorithms by all measures for all but the 5% minor clone mixtures (Figure]^. 
The inability of the independent algorithm to model spatial correlation between adjacent segments 
results in a higher number of spurious copy number transitions and low precision with respect 
to estimation of breakpoint presence and clonality. Precision is higher for the Viterbi due to the 
smoothing properties of the algorithm. However, recall is lower than the genomegraph method, 
since copy number changepoints either do not precisely coincide with respective breakpoints, or are 
smoothed over entirely for copy number changes in low proportion clones. Finally, joint inference 
noticeably improves the accuracy of segment copy number prediction over the current state of the 
art, viterbi inference in an HMM. 


4.4 Comparison with Existing Copy Number Inference Methods 


We compared our genome graph approach to four existing methods for subclonal copy number 
inference in heterogeneous tumour samples: TITAN [6], CloneHD |3], and Theta2.0 mm (see 
supplementary section 6.4). We first simulated 10 normal genomes, including SNP genotypes. 


^total haploid coverage of 0.2 reads (paired) per nucleotide corresponds to a 40X sequence coverage genome, 
lOOXlOObp reads from ~400bp fragments 
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Figure 4: Box plots of inferred clonal fraction for the minor tumour clone (left) and normal 
clone (right). Datasets are grouped by simulated minor clone proportion (0.05, 0.1, 0.2, 
0.4). Normal proportion is 0.4 for all simulations. 


(a) Proportion minor clone in the mixture with normal fixed at 0.4 
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(b) Proportion of normal in the mixture 
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(c) Proportion of genomic segments with divergent (subclonal) copy number 
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Recombination sites were selected at a rate of 20 per 100Mb. For each region between adjacent 
recombination sites we selected a random individual from a phased 1000 genomes reference panel 
[3], and used that individual’s SNP genotypes for that region. For each of the 10 normal genomes, 
we then simulated 4 pairs of tumour clones using the re-sampling method described above, varying 
the proportion of the genome that is divergent. Finally we simulated 4 genome mixtures for each of 
the 40 pairs of tumour clones. For each simulated mixture, we calculated the number of reads for 
each clone in an approximately SOX sequencing dataset, and simulated fragments with mean 300 
and standard deviation 30. We then added added the germline SNPs to lOObp reads from these 
fragments, flipping the genotype of SNPs based on a simulated sequencing error rate of 0.005. 

We compared each tool’s output CNA predictions and mixture prediction to the true simulated 
values using 3 measures 

1. Proportion of segments with correct major/minor copy number for each clone, or total copy 
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number where allele specific copy number was unavailable 

2. Relative error in estimation of normal proportion. 

3. Relative error in estimation of minor clone proportion. 

The results are shown in Figure For our simulated data, our method outperforms the competing 
methods on all measures. In general, failure of the competing methods to correctly predict copy 
number is primarily due to the inability to identify the true mixture. Given an incorrect estimate 
of the normal contamination and minor clone proportion, copy number estimates are unlikely to 
be correct. 

We considered the possibility that other tools may perform poorly on datasets with signifi¬ 
cant amounts of high level amplification. Thus we simulated an additional 40 genome mixtures, 
modifying the target copy number state distribution to produce genomes for which segments with 
more than 4 total copies are very rare. For this simulation we fixed the proportion of the genome 
that is divergent at 0.25, and varied the minor clone proportion in the mixture. As can be seen 
in Figure each tool performs better in some circumstances, but only the genomegraph method 
performs consistently well across different mixtures. 

5 Discussion 

We have developed a novel method for joint inference of genome content and structure. Using a 
comprehensive set of simulated genome mixtures, we have shown that joint inference out-performs 
naive methods with respect to identification of subclonal breakpoints and classification of break¬ 
points as real or false positive. Furthermore, we have shown that inclusion of breakpoints during 
copy number inference provides a modest but consistent improvement in the accuracy of predicted 
segment copy number. Results from a comparison against existing subclonal copy number inference 
tools shows our method out-performs existing methods on the simulated mixtures. 

Our method provides several additional novel contributions. When selecting a segment length, 
a balance must be struck between the need for increased statistical strength provided by longer 
segments, and the additional noise that results from a true copy number transition occurring in 
the middle of a segment. By segmenting at the break-ends of rearrangement breakpoints, we 
attempt to capture a majority of the copy number transitions with our segmentation. We then 
use a medium size segmentation to break up longer unsegmented regions, allowing us to retain 
improved statistical strength, while modeling for a majority of the changepoints. Furthermore, 
haplotype blocks are used in a novel way to improve the accuracy of allele specific read counts. 
The aggregation of allele specific read counts across segments allows us to model the alleles of 
large segments as (approximately) independent, making inference more tractable. Compared to 
copy number inference tools such as Titan, our state space is in some ways more comprehensive, 
allowing for any combination of clone copy number at each segment. 

Though our initial results are promising, significant work remains. The proposed moves of our 
greedy heuristic are not sufficient to escape local optima of significant importance. For instance, a 
breakage fusion bridge cycle would be represented as a loop bond edge in the genome graph. We 
do not include loop bond edges as they would never be selected by a matching based approach. To 
support loops, we could either add dummy segment edges, making each loop a 3 edge alternating 
path, or use a more general matching algorithm such as minimum perfect b-matchings to identify 
optimal moves. Finally, future work may show benefit to inclusion of rearrangement breakpoints 
during learning, providing motivation for development of a method that jointly infers genome 
content, genome structure, and clone structure. 
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6 Supplementary Methods 


6.1 Expectation Maximization Method for Learning h 

To infer the haploid read depth parameter h, we break the dependency between segments connected 
by breakpoints, and rely on a restricted model that includes only the dependencies between segments 
adjacent in the reference genome. We use a similar likelihood function, and then select a tractable 
graphical model (HMM) structure that matches the original problem as closely as possible. Hidden 
states in the HMM correspond to copy number matrices: for segment n, copy number state c is an 
M by 2 matrix, with entry Cmt for the copy number of allele i for clone m. 

The HMM version of our problem requires a finite state space. We restrict the state space 
by imposing a maximum copy number, and a maximum copy number difference between clones. 
Segments with true copy number outside this finite state space are modeled with an additional 
null state 0. Transitions into the null state are penalized. Segments in the null state are free 
to take on any copy number state bn in the inhnite state space of all copy number matrices. 
Thus Xn is dependent on both the copy number prior C and an independent copy number state 
bn- A uniform (improper) prior is placed over bn- In practice, only bn in the neighborhood of 
argmaxjp(x„|h, Cn = b) are considered. The graphical model for the HMM is depicted in Figure]^ 


The full data log likelihood can be expressed as given by Equation 18 where q{b, c) is given by 


Equation [T9| 


logp{X,B,C\h,-) = logp{X\B,C,h,-)+\ogp{B\-)+logp{C\-) 

= EEE I{bn = b,Cn = c) logp(Xnlh, Cn = q{b, c), •) 

n b c 

+EE I{bn = h) \ogp{bn = h) 

n b 

+ ^I(ci = c) log7ri(c) 

c 

N 

+EEE I{Cn-l = C,Cn = c') log A„(c, c') (18) 

n=2 c c’ 


q{b,c) = 



(19) 


We model the probability of a segment having copy number that is higher than the maximum 
copy number in our HMM state space with an exponential distribution over the length of the 
segment, with exponential parameter Aq. Calculate an(c), the probability of high level amplification 


of segment n, as given by Equation 20 


an{c) = \a exp {-XalnKc = 0 )) 


( 20 ) 


Additionally, we model the probability of a segment having copy number that is divergent 
between tumour clones with an exponential distribution over the length of the segment, with 
exponential parameter A^. Calculate the number of alleles with divergent copy number d{c) as 
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given by equation 21, and the probability of divergence Pn{c) as given by equation 22 


d{c) 

Pn{c) 


M M 

I m=2m'=2 
Xd exp {-Xdlnd{c)) 


( 21 ) 

( 22 ) 


Finally, we model the probability of copy number transitions between adjacent segments as an 
exponential distribution over the copy number difference between the segments. The parameter of 
the transition exponential, /3 is set to the same value as the telomere penalty in the combinatorial 
problem. Calculate the number of telomere copies t(c, d) at a transition from a segment with copy 
number d to a segment with copy number c as given by equation 23 and the probability of the 


transition as given by equation 24 


i(c,c') = 


c — c 


(23) 

r/(c,c') = /3exp(-^t(c,c')) (24) 

The prior probability over the copy number of the first segment p{ci = c) is thus given by 


equation 25 and the transition probability from segment n — 1 to segment n is given by equation 


7ri(c) 

= p{ci = c) 



oc Pi{c)ai{c) 

(25) 

An{c,d) 

= p{Cn = C Cn-1 = d) 



OC Pnic)an{c)r]{c, d) 

(26) 


The expected value of the log likelihood function with respect to the conditional distribution 
p{C\X, •), excluding terms independent of /i, is given by Equation 

Q(h,h(*-i)) = Ep(5^^|^^,(*_p^.)[logp(X,S,C|h,.)] 

= = b,Cn = c\Xn, ■) log p{Xn\h, Cn = q{b,c), •) 

n b c 

= '^'^P{Cn = c\Xn,h^^~^\-)logp{Xn\h,Cn = C, •) 

n c^0 

+ ^^p(6„ = b,Cn = <I}\Xn, ■)logp{Xn\h, Cn = 6,-) (27) 

n b 

To calculate the joint-posterior marginal probabilities p{bn = b,Cn = c\xn, ■), and thereby 

calculate Q(/i,(E-step), we use the forwards-backwards algorithm. The graphical model is 
a polytree, thus the forwards-backwards cannot be applied directly. Instead, we model both c„ and 
bn jointly, and apply the forwards-backwards to the merged state space of Cn and bn- 

Typically the state space of the jointly modeled variables would be a cross product of the state 
space of each variable independently. Eor our model, such a state space would be unmanageably 
large. Eortunately a cross product state space can be avoided due to the mutually exclusive effect 
of Cn and bn on the likelihood of Xn, and given that Cn and bn are independent of fen-i- Eor instance, 
when marginalizing over the space of Cn and bn in the forwards pass, part of the marginalization 
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can 


take place analytically (Equation [^. 

p{bn = b,Cn = c\xi-n-l) = p{bn-l = b' , Cn-l = c'\xi-n-l) 


b' c' 


X p{hn = b,Cn = c\bn-l = b', Cn -1 = c') 

^p(c„_i = c'\xi:n-l)p{bn = b,Cn = c\Cn-l = c) 




+ ^p(6n-l = b',Cn-l = %\xi-,n-l)p{bn = b,Cn = c\Cn-l = 0) 


(28) 

The backwards pass can be simplified similarly. In effect, we can consider a state space that 
includes all regular (/ 0) states for Cn, concatenated with the additional states for Entries of 
the transition matrix are given by Equation |29[ 


p{bn = b,Cn = c|c„_i = c) = p{Cn = c|Cn-l = c')p{bn = b) (29) 

We maximize Q{h, with respect to h (M-Step) numerically. The gradient of Q{h, 


used for numerical maximization, can be computed as given by Equations 30 33| 


ag(/i,M‘-i)) 


d 


dhn 


EE p{Cn = c\Xn, ^\ •) ^ - 7 ^ log p(x„fc |/l, C„ = C, •) 


C7^0 


k=l 


d 


+ EE p{bn = b,Cn = 0|Xn,/l(* •) 7 ^ logp{Xnk\h, Cn = b, •) 


n b 


k=l 


(30) 


d 


dh„ 


logp{Xnk\h,Cn,-) = 


7^logp{Xnk\h,Cn,-)) 

^l^nk J \ ) 


dfJ'nk 

Oh^n 

d^nk 

dhffi 


_ ^ 

dhrr 

— ^nm 


Pnlr. 


(31) 

(32) 

(33) 


A negative binomial specific term is given by Equation 34, and the poisson specific term by 

d 


Equation 35 


, /'ll. \ ^nk T ^nk 

logp{Xnk\h,Cn,-) = -^- 

f^nk '^k “r Mn/c 


9f^nk 

log p{Xnk\h,Cn,-) = — -1 

^t-^nk l^nk 

6.2 Estimating the overdispersion parameter r 


(34) 

(35) 


We estimate the overdispersion parameter r offline from segment read counts. We assume that the 
majority of adjacent segments have the same genotype, and thus the same expected read depth 
7 . Under this assumption, we identify the r that maximize the likelihood of the read count data 


(Equation 36) for pairs of adjacent segments i and i + 1 with identical read depth 7 ^. We use 
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gradient descent to find a local optima of the likelihood with respect to both r and 7 *, with partial 
derivates with respect to r and 7 * calculated as given by Equations and respectively. 




d-fi 

d£{r,l,p,'y) 

dr 


N 

2 2z 


E E \ogp{Xn\ln,Pn-,li-,'' 


2 = 1 71 = 22—1 

N 

2 22 

E E log {r{xn + r)) - log(xn!) - log (r(r)) + r log (r) 

2 = 1 71 = 22—1 

-r log (r + In'Ji) + Xn log(/„ 7 i) - Xn log (r + /n 7 i) 

2i 


I'ln ^ Xn Xnln 
4"^ , r + Inli li r + In'^i 

71=22 — 1 

N 

2 22 

E E ip{xn + r) - ip{r) + log (r) + 1 


2 = 1 71=22—1 


- log (r + In'Ji) - 


r + Inli r + In-fi 


(36) 

(37) 


(38) 


6.3 Independence of Segment Read Counts 


Previously, [9] modelled segment read counts as a single draw from a multinomial likelihood. Sup¬ 
pose, in addition to the multinomial, we model the total number of reads T measured by the 
sequencing experiment as a Poisson distributed random variable with unknown mean A. The joint 
likelihood of multinomial distributed read counts Xj and total read count T given proportions tt* 
and expected total read count A can be written as given by Equation [3^ Introduce variables A,- 
such that A = ^ ■ Xj and iTj = Then Equation 


39 


can be rewritten as given by Equation 


which is the likelihood of J poisson distributed independent random variables with means Xj (cy. 
Thus, our use of independent Poisson likelihoods can be seen as equivalent to the multinomial used 
by [9j if we also assume the total read count of the experiment is a Poisson distributed random 
variable. 


p(r,xi, ..,xj|A, 7 ri, .., 7 rj) oc e (39) 

j 

p{xi,...,xj\Xi,...,Xj) oc (40) 

j 

6.4 Parameters used for existing methods 
6.4.1 Theta2.0 

Theta requires an existing segmentation, thus we merged adjacent segments with identical copy 
number states to form a collection of perfect segments. We then counted reads within those 
segments and used those read counts as input to Theta. We used the — NUM_INTERVALS 15 option 
to allow for reasonable running time, and the — FORCE argument to for Theta to run on some of the 
genomes which were sub-optimal candidates for Theta analysis. We then used octave to execute 
the runBAFGaussianModel function to select a single solution from potentially multiple optimal 
solutions. Only the solution with 2 tumour clones was considered, as that is what was simulated. 
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6.4.2 Titan 


As input to Titan, we calculated counts of reads contained within regular lOOObp segments, and 
counts reads supporting the reference and non-reference allele for heterozygous germline SNPs. We 
then ran Titan, without correcting for GC and mappability (as this was not simulated). We used 
multiple initializations, with normal contamination from 0 to 1 in increments of 0.1, and ploidy 
from 1 to 4 in increments of 1. The number of Titan clusters was fixed at 2. We then selected 
the solution with lowest S_Dbw validity index. Since the parameterization for Titan is slightly 
different than for other tools, we used the following formula to convert from estimates of tumour 
clone prevalences ti and t 2 , and normal contamination estimate n, as follows. 


mixture = [n, (1 — n) x t 2 , (1 — n) x \ti — t 2 \] 


(41) 


6.5 CloneHD 

As input to CloneHD, we calculated counts of reads contained within regular lOOObp segments, 
and counts reads supporting the reference and non-reference allele for heterozygous germline SNPs 
(b-allele frequency (BAF) data). We used a series of steps as outlined in run_example. sh. 

1. use filterhd to analyze the normal read depth data for technical read depth modulation 

2. use filterhd to analyze the tumour read depth data to get a benchmark of the log likelihood 

3. use filterhd to analyze the tumour read depth data with the bias estimate from the normal 

4. use filterhd to analyze the tumour BAF data 

5. use clonehd to infer copy number based on tumour read depth and BAF data. 
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Figure 5: Performance of the genomegraph algorithm compared to two breakpoint naive 
approaches, the independent model and and HMM using the viterbi algorithm. For each 
set of plots, one parameter of the simulation was varied. 
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Figure 6: Performance of the genomegraph algorithm compared to three existing methods, 
TITAN, CloneHD, and Theta2.0. Each plot shows performance with one parameter of the 
simulation varied. 
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Figure 7: Performance comparison of the genomegraph algorithm with TITAN, CloneHD, 
and Theta2.0, using a dataset with limited amplified regions. Showed is performance when 
varying the proportion of minor clone in the mixture with normal fixed at 0.4 
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Figure 8: Graphical model of the HMM version of the problem used for parameter inference. 
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